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Abstract 

We describe a method for fitting distributions to data which only requires knowl- 
edge of the parametric form of either the signal or the background but not both. 
The unknown distribution is fit using a non-parametric kernel density estima- 
tor. A transformation is used to avoid a problem at the data boundaries. The 
method returns parameter estimates as well as errors on those estimates. Sim- 
" 53 ■ ulation studies show that these estimates are unbiased and that the errors are 

correct. 
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1. Introduction 

Almost all data sets encountered in High Energy Physics (HEP) have events 
that were generated by different mechanisms. Interest often focuses on one 
of these mechanisms, what often is called a signal. By judicious cutting on 
auxiliary variables it is often possible to bring out the signal, that is, to improve 
the signal to noise ratio, but it is usually not possible to eliminate the noise, or 
background events, altogether. In many cases the researchers want to estimate 
the parameters involved in the signal such as the location and the width of an 
invariant mass peak, together with their statistical errors. In order to do so, they 
need to hnd a model for the data, and because the data still contains background 
events the model usually has the form of a mixture distribution. In practice it 
often turns out to be fairly straightforward to find a parametric description for 
the signal density, for example from theoretical considerations. In contrast it is 
often very difficult to model the background density, and the typical approach 
is one of trial and error: fit a number of shapes until a satisfactory one is found. 
This approach has several drawbacks: (i) it is quite time consuming, (ii) different 
researchers might end up with different parametrizations, (iii) it is hard to know 
when to stop in order not to overfit the data and (iv) it is almost impossible to 
gauge the systematic uncertainty of an incorrectly specified background on the 
parameter estimates. 

There are, however, other solutions to the problem of estimating a density, 
namely the so-called nonparametric density (NPD) estimators. A number of 
such methods are known, such as kernel methods, nearest neighbor methods, 
the method of penalized likelihood and others. We could apply any of these 
methods to HEP data but unfortunately they do not yield the parameters that 
are the ultimate goal of the fitting process. 

The semiparametric method that will be discussed here combines traditional 
parametric fitting with nonparametric density estimation. A parametric descrip- 
tion is used for the signal, the background is modeled nonparametrically and 
these two ingredients are combined into one fitting function. 

This method is made especially useful by two features of HEP data. First, 
we often have available a sample of pure background events, either by elimi- 
nating potential signal events by cutting on auxiliary variables or from Monte 
Carlo. This will allow us to employ the most powerful techniques developed 
for NPD estimation, for example the methods for choosing the optimal amount 
of smoothing. Secondly, data in HEP is usually supported on a finite interval. 
This normally would lead to the so-called boundary problem in NPD density es- 
timation, a seemingly innocuous problem for which to date no single solution is 
known. In this work we solve this problem by transforming the data, a solution 
which is viable because in HEP we are not interested in the NPD estimate itself 
but in the signal parameters. As our simulation studies show this solution of 
the boundary problem can be applied in a wide variety of situations and leads 
to correct estimates of the parameters and their errors. 

There is a large volume of literature in the field of Statistics on nonparametric 
density estimation. Standard references include Scott [l|, Silverman Q, Tapia 
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and Thompson [||, Fryer Q and Bean [B[. Kernel density estimation was first 
introduced independently by Rosenblatt 6| and Parzen [7|. 

One of the early discussions of NPD estimation in HEP was by Cranmer 
in 2001, who also discussed their use in multivariate analysis It has since 



been used in a variety of HEP analyses (|l0|, |l2j, [13j]). 



2. Kernel Density Estimation 

Nonparametric density estimation is concerned with estimating a probability 
density without assuming a functional form for the density. The "nonparamet- 
ric" in the name is slightly misleading because these estimators have in fact 
something like parameters. There are a number of different methods that have 
been studied in some detail, such as nearest neighbor estimators, splines, or- 
thogonal series, and so on. Even a histogram (properly scaled) can be viewed 
as a NPD estimator. We will concentrate on the best known and most widely 
used method, the kernel density (KD) estimator. 

Let X\, ..,X n be observations from some unknown density /. Let K be any 
continuous, nonnegative and symmetric function with K(x) — > as x —¥ ±oo. 
Often K is chosen to be a probability density itself, say a Gaussian density. We 
then get the following definition of a nonparametric kernel density estimator: 

i—l 

As has been noted many times, the choice of the kernel function is of minor 
importance (Scott [H). We will use the Gaussian density in this work but other 
kernels could also be chosen. 

The h is a tuning parameter called the bandwidth. It governs the variance- 
bias trade-off, with a small value of h resulting in a very ragged curve whereas 
a large value oversmoothes the density. 

How do we choose the bandwidth hi This has been one of the most active 
areas of research in the field of Statistics in the last 20 years, and a number of 
methods have been proposed. To begin with a criterion for "best" is needed, 
and most research has focused on either the Integrated Squared Error 

ISE(h) = ^ (jh(x) - f{x)fdx (2) 

J — oo 

or the Mean Integrated Squared Error MISE(h) = E[ISE(h)], the expected 
value of ISE(h). Clearly both depend on the unknown density / and there- 
fore have to be estimated from the data. Another criterion is the Asymptotic 
Mean Integrated Squared Error AMISE(h), based on a consideration of the 
large sample behavior of the MISE. Alternatively one can consider the corre- 
sponding Mean Integrated Absolute Value Error criterion (Devroye and Gyorfi 
[14[). Unfortunately, there is no obvious way to choose among these criteria, 
although Jones [HI gives some strong arguments in favor of MISEQi). Detailed 
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discussions on bandwidth selection can be found in Turlach [16| and Chiu [17 1. 
In this paper we will not advocate the use of one specific criterion but rather we 
will vary h over a range of reasonable values, thereby gaining insight into the 
systematic uncertainty introduced by this choice. 

It is intuitively obvious, and can be shown to be mathematically correct, 
that, in regions where the true density changes slowly, the optimal value of 
h should be larger than in regions where / changes rapidly. It is therefore 
reasonable to use an adaptive bandwidth as follows: 



nh *r~i h x 

One suggestion is to use a fixed bandwidth h to get a pilot estimate ft and then 



recompute the density estimate using h x = h/y fh(%) (Abramson [18J). In our 
simulation studies there was very little difference between the adaptive method 
and using a fixed bandwidth, but it is a good idea to investigate this issue in 
any specific situation. 

For the simulation studies shown later we used a simple bandwidth formula 
suggested by Scott flj : let s be the standard deviation of the data and let IQR 
be the interquartile range, then 

h = 1.06 min{s, IQR/lM}n- 1/5 (4) 

The extension to multivariate data is straightforward and usually done using 
a product kernel: 

1 " d x ■ - X*- 

i— 1 j—1 J 



3. The Problem of Boundaries 

One typical feature of the data sets in HEP is that they are truncated, i.e., 
the data are bounded. For univariate data the boundary consists simply of 
the two endpoints of an interval [A, B], Unfortunately nonparametric density 
estimators have a difficult time dealing with boundaries. The reason for this 
is that the lack of data beyond the boundary leads to a smaller estimate just 
inside the boundary and, because they have to integrate to 1, they overestimate 
the density in the middle. 

A number of methods have been proposed to deal with the boundary prob- 
lem. One approach is to modify the kernel near the boundary (Gasser, Miillcr 
and Mammitzsch [l9[ , Jones [2(| )■ The most serious problem with these mod- 
ified boundary kernel methods is that they often lead to negative density esti- 
mates. 

Another idea is to fit local polynomials of low order as discussed in Cheng, 
Fan and Marron j2l|. The argument is that local polynomial estimation would 
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automatically correct for boundary effects in regression, and should therefore 



also work in density estimation (Fan and Gijbels [22J). A boundary correction, 
though, takes place only if the polynomial is of the "correct" order, else it can 
make the boundary effect even larger. Local polynomial fitting has not been as 
successful in density estimation as it is in regression problems, although Zhang 
and Karunamuni |23| improved the performance of this method by combining 
it with a bandwidth variation function. 

Another set of boundary correction methods modifies the bandwidth near 
the boundaries. The basic idea is that a larger bandwidth close to the boundary 
should compensate for the lack of data beyond it. Specific proposals can be 
found in Rice 24 1, Gasser and Miiller 25 1. In contrast Dai and Sperlich 



26| actually advocate a smaller bandwidth. 

One of the early attempts at solving the boundary problem was the reflection 
method, introduced by Schuster [27| and Silverman [2[ and later extended by 
Cline and Hart |28] . A more recent extension of this method is to create pseudo 
data to correct for edges (Cowling and Hall [2^].) This method is more adaptive 
than the common data reflection approach in the sense that it corrects also 
for discontinuities in derivatives of the density. There is also a whole class of 
transformation methods (Wand, Marron and Ruppert [3l[ , Ruppert and Marron 



32j, and Yang [33fj) . In addition, Zhang, Karunamuni and Jones [30( suggested 
a method of generating pseudo data combining the transformation and reflection 
methods. 

Unfortunately, despite the numerous methods that have been proposed, none 
has been shown to yield acceptable results in a wide range of cases. In this 
paper we will take advantage of the fact that our ultimate goal is not density 
estimation but the estimation of parameters such as the number of signal events. 
We can therefore transform the data to the whole real line and simply do the 
fitting there. Note that these are transformations of the data that leave the 
parameters unchanged. 

Say the original observations are X±, .., A„ and they are located in the in- 
terval [A, B]. Then we will calculate new observations Xf, .., Aj with X J = 
H(Xk)', k = 1, ..,n. The transformation H can be any function that is mono- 
tonically increasing, continuous and that maps [A,B] — > (—00,00). We will use 

When we transform the data, what happens to the density of the original 
data and to its parameters? Say the original density was given by f(x) 9) where 
9 is the vector of parameters. Let's denote the density of the transformed data 
by g. Then 

g (y-9) = f{H-\y)-9)-^H-\y) (6) 
ay 

For the transformation above we have 

rr-u n A + B e y d TT _ U , (B-A) e y 
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and © becomes 

A + Be v (R — A)e y 

We can now find the parameter estimates 9 of 9 via maximum likelihood. 
The log- likelihood function 1(9) is given by 

1(6) = J2 l °S9(Xr ;0) = ^>g/pQ;fl) ■ \ > (9) 
»=i »=i (1 + e » J 

and we can see that it has a rather simple form. 

One small problem with the transformation approach is that it makes vi- 
sualization of the fit somewhat difficult. Of course we would like to visualize 
the fit on the original data. To achieve this we need to "back-transform" the 
estimated density. If g(y) is the density estimate of the transformed data at 
the point y, then the corresponding estimate of the original data at the point 
x = H^ 1 (y) is given by 

/(*) = 9 (H( x) ) • !*(*) = g(H(,)) • {x _ B A ~t x) 

and we see that this function has singularities at the boundary points A and 
B, even though it still is a proper density which integrates out to 1. Because 
the fitting is done entirely in the transformed space this does not affect the 
parameter estimates. 

An added benefit to this approach is that the transformed data becomes 
more "Gaussian-like" and as a consequence the formula for the bandwith h (Eq. 
[4j is known to be close to optimal. The use of transformations for this purpose 



was discussed in Hartert [34 1 



4. Semiparametric Fitting 

The idea of mixing parametric and nonparametric components in a model 
is quite common in regression problems but has received fairly little attention 
in density estimation. For some work in the statistics literature see Olkin and 
Spiegelman (39[, Schuster and Yakowitz [4(| and Faraway [4lj ]. 

In HEP numerous papers have used the idea of estimating the background 
non-parametrically, though in general without a consideration of the boundary 
problem. Cranmer [42} presents a general discussion of KDE and suggests using 
the boundary kernel method or the reflection method in case of a boundary. 



Adelman [43j and Meyer [44j are two examples of the implementation of the 
boundary kernel method in a HEP analysis. Our solution to the boundary 
problem, though, is more generally applicable and easier to use. 

Say we have a set of observations (or events) Xi,.,, X n with Xi £ [A, B}. We 
assume that each event was either generated by a background distribution with 
density fs(x), or by a signal distribution with density fs(x; 9). Furthermore, we 
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assume that (1 — a)100% of the events come from the background distribution. 
Finally, we also have a sample of pure background events Yi,..,Y m . 

Notice that the background distribution might also depend on some param- 
eters, but because we will not use the functional form of fs we do not need to 
worry about them. Then the density of the mixing distribution is given by 

f(x;a,6) = (l-a)f B (x) + af s (x;6) (11) 

We can use the maximum likelihood method to find the estimates a and 9 
of a and 9 by maximizing the log-likelihood I (a, 9) given by 

n 

l(a,0) log a, 0) (12) 

i=i 

In our situation we don't really know the functional form of /b, and we would 
therefore like to replace fg by the nonparametric kernel density estimator. To 
do this we first have to transform the data to get new observations Xj , ..,X^ 
and Y± ', .., Y^ as described above. Next we use the background sample to find 
the optimal bandwidth h opt and then we estimate the background density gs 
at the points Xf, ..,X^ using h op t- 

The log-likelihood function then becomes 

l(a, 9)=J2 l0 S I i 1 ~ a )9B(X?) + af s (X t ; 9) ■ {B - A) f ' ) (13) 
i=1 { (l + e A > ) 2 J 

The maximization could now be done using MINUIT [35j or any other opti- 
mization routine. 

If we want to do a hypothesis test for the presence of a signal H : a — vs 
H a : a > we can use the likelihood ratio test with test statistic 

T = 2 (l(a,0) - 1(0,0)) (14) 

Of course, the conditions of Wilks' theorem [36] are not satisfied here. Nev- 
ertheless, quite often T will have an approximate chi-square distribution with 
1 degree of freedom anyway. Also, because it is a large sample theorem, even 
when it does apply one needs to check whether the asymptotic regime has been 
reached. This issue of course applies to the parametric method just as much as 
to the semiparametric one. 

If instead we want to find a confidence interval or an upper bound for the 
parameters, we can find error estimates from the usual large sample theory of 
maximum likelihood estimators (Casella and Berger [381]). These worked quite 
well in the simulation studies we conducted. As an alternative one could find 



the error estimates via the statistical bootstrap (Efron 37]). An example of the 



use of kernel density estimation as well as the bootstrap method in Astrophysics 



is discussed in Adriani 45 1 
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5. Case Study 

As an illustration, consider the following: The background is itself a 50-50 
mixture of a Beta distribution with shape parameters (1, 10) and a uniform 
distribution. This density covers two extreme cases: on the left boundary the 
true density has a large slope and on the right it has slope 0. The signal 
distribution is a Gaussian \i — 0.35 and o = 0.02. 

Figure 1 shows one histogram of 1000 pure background events and another 
of 1000 data events of which 50 are signal events. The other two histograms cor- 
respond to these same two samples but with the variable transformed. For the 
transformed histograms, their respective semiparametric fits are shown as con- 
tinuous curves. Here we fit for both the number of signal events and the signal 
location. The estimates are 51.0 ± 9.8 and 0.354 ± 0.005. (The estimates using 
the the exact parametrization would have been 52.9 ± 9.8 and 0.353 ± 0.005). 
If instead we had used the bootstrap to estimate errors we would have found 
51.0 ± 10.1 and 0.354 ± 0.005. Figure 2 shows the results of a sensitivity study. 
Using the statistical bootstrap and 5 different bandwidth selection methods we 
find that the true optimal bandwidth is between 0.42 and 0.52. Varying h over 
this range we see that neither the estimates nor their errors depend on the exact 
choice of bandwidth h. 

6. Simulation Study 

We will consider six different cases: 

Case 1 background is a 80-20 mixture of a Beta(3,20) and a Beta(3,6). In 
this case the density goes to as x goes to or 1, so no transformation is 
necessary. 

Case 2 background is an exponential distribution with rate 1, truncated at 
x=l. An example of a density with a moderate slope on one side (x=0) 

Case 3 background is a 50-50 mixture of a Beta(l,10) and a uniform [0,1]. 
An example of a density with large slope on one end and slope on the other. 

Case 4 background is a uniform on [0,1] 

Case 5 background is an exponential rate 1. An example with a boundary 
on just one side. For this we use the transformation Y = log(A). 

Case 6 an example of a multivariate problem. The background is a bivari- 
ate normal with means (0,0), standard deviations (0.3,0.5) correlation and 
truncated to [0, l] 2 . 

In cases 1-5 a signal is modeled as a Gaussian mean 0.35 and standard 
deviation 0.02. In case 6 it is a bivariate Gaussian with means (0.35,0.35) 
standard deviations (0.02,0.02) and correlation 0. 

Examples are shown in figures 1 (1000 background events) and figure 2 (950 
background events and 50 signal events). 
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We begin with the test for the presence of a signal. Here for the method 
to work it has to achieve the nominal type I error probability a. We will use 
a = 1% in our study which means we require that the estimated number of signal 
events be positive and that the likelihood ratio test statistic be larger than 5.4. 
We generate lOOOr events each for the pure background sample and 1000 events 
of the " data sample" , that is we assume a ratio of r between the sizes for the 
sideband and the signal region. We use r = 1/2, 1, 2 and 4. Because we are 
testing the null there is no signal present. Wc find the likelihood ratio statistic 
using the three background estimates described above, except for model 6 where 
we include only the exact parametrization and the semiparametric fit. Here the 
signal location and width are fixed so there is no issue of the look-elsewhere 
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effect. This is repeated 25000 times. The curve for the exact parametrization is 
in blue, for the semiparametric method in green and for the false parametrization 
in red. We find 



Model 1 Model 2 Model 3 




i i i i i i i r i i i i i i i r i i i i i i i r 
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Model 4 Model 5 Model 6 




i i i i i i i r i i i i i i i r i i i i i i i r 

0.5 1.5 2.5 3.5 0.5 1.5 2.5 3.5 0.5 1.5 2.5 3.5 



T T T 

We see that using the exact parametrization the true type I error probability 
is just about the nominal one. Using the semiparametric method the true type 
I error probability is sometimes a little higher than the nominal one but ap- 
proaches it for larger background samples. Using the slightly false parametriza- 
tion it can be as much as 10 times the nomial one. 

Next we add a signal drawn from a Gaussian distribution /i = 0.35 and 
a = 0.02. Both data sets have 1000 events. We vary a from 0.5% to 5%, 
equivalent to 5 to 50 signal events. Each simulation run is repeated 2500 times. 
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First, what is the power of the test, that is its ability to detect the signal? 
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Power of Test 
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The power of the test based on the semiparametric fit is almost as good as 
the true parametric fit. Note that the power of the "false" parametrization is 
missleading because it already had a type I error that was to large. 

Are they estimating the signal sizes correctly? Here are the means of the 
MC estimates: 
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x = 0.5 
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T = 2 
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Model 1 



Model 2 



Model 3 





t i i i r 

10 20 30 40 50 
nS 




n i i i r 

10 20 30 40 50 
nS 



Model 4 



Model 5 



Model 6 






Are the error estimates correct? For this we will calculate the 1 standard 
deviation confidence intervals and check whether the percentage of intervals 
containing the true number of signal events is the required 68%: 
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Again the semiparametric method has errors just about correct, whereas the 
false parametrization can yield errors quite wrong. 

Generally the semiparametric method does slightly worse than the exact 
parametrization (which unfortunately in real life is unknown) but better than 
a slightly false but "reasonably" looking one. 



7. Computational Issues 

The work described in this paper was done using the statistics package 
R, available at [http://cran.r-project.org/[ which includes the routine density 
for finding the optimal bandwidth h (with a choice of 5 different methods). 
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This routine was also used to calculate the density estimates. R uses function 
calls to C++ routines which are freely available. The HEP analysis platform 
RooFit, available at http:/ /roofit. sourceforge.net/, includes the KEYS program 
written by Kyle Cranmer. In Perl, an implementation can be found in the 
Statistics-KernelEstimation module. In Java, the Weka package, available at 
|http: / /www.cs.waikato.ac.nz /ml/weka/| provides weka. estimators. KernelEstimator, 
among others. In Gnuplot, kernel density estimation is implemented by the 
smooth kdcnsity option; the data file can contain weight and bandwidth for each 
point, or the bandwidth can be set automatically. C++ code for calculating the 
optimal bandwidth is available at |http : / / www . umiacs . umd .edu/labs/cvl/pirl / vikas| 
and is described in Raykar [47| • Recently developed algorithms for the so-called 
Fast Gauss Transform have made it possible to calculate the NPD estimates 
even for very large data sets. One example is discussed in Yang et al. 



A sample C routine which calculates limits for Model 2 using the semipara- 
metric and the correct parametric background is available from the authors at 
|http: / /charma. uprm.edu/~rolke/publications. htm| . 



8. Conclusions 

We describe a method that uses nonparametric density estimation to esti- 
mate the background density and then finds estimates and errors for the param- 
eters of the signal such as the number of signal events via maximum-likelihood. 
This is made feasible by the availability of a pure background sample. A sim- 
ulation study shows that this method is quite competitive, almost as good as 
using the (in real life unknown) true parametrization and superior to using an 
almost correct parametrization. Moreover, because the nonparametric density 
estimate depends smoothly on the bandwidth, it is easy to do a sensitivity study 
and gain insight into the systematic uncertainty caused by different background 
shapes. 
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